Kickstarter campaigns analysis (2009-2017)

Dataset

The dataset of Kickstarter Campaigns was downloaded from Kaggle (source).

It contains information of over 300.000 Kickstarter project that were launched in years 2009-2017. The dataset contains variables such as:
  • Name of the project
  • Category (159 categories) - specific category of campaign.
  • Main Category (15 categories) - general category of campaign.
  • Goal of campaign - how much the project aimed to collect.
  • Launch date - when the project was launched
  • Deadline - when was the deadline of the project
  • Currency - what is the currency of the project goal
  • Pledged - how much money was pledged to the project in specified currency
  • State - whether the campaign was successful, failed, suspended or canceled
  • Backers - number of backers who supported the project with their funds
  • Country - country of the project
  • Usd_pledged_real - how much was pledged (Calculated in USD)
  • Usd_goal_real - the goal of the campaign (in USD)

Hypothesis and Research question

  • Research question - Failing campaigns fail because they set higher goals.
  • Hypothesis - There is significant difference in mean goal of successful and unsuccessful campaigns
  • 1a. Is mean usd_goal different in both groups?
  • 1b. Is mean usd_pledged different in both groups?
  • 1c. Is mean number of backers different in both groups?

Loading the data and data preparation

  • Choosing columns
  • Checking if there are NA in the dataset
  • Factorising categorical values
  • describing data
# Loading Data, choosing columns
df <- read_csv('ks-projects-201801.csv')

df <- df %>% 
  select(c('name', 'category', 'main_category', 'launched', 'deadline', 
           'state', 'backers', 'country', 'usd_pledged_real', 'usd_goal_real'))

# Checking NA's in the dataset
df %>%
  summarise_all(funs(sum(is.na(.)))) 
# it seems there are no NA's (only in names which mean nothing in this research)

# Factorise categorical variables
df['main_category'] = as.factor(df$main_category)
df['category'] = as.factor(df$category)
df['state'] = as.factor(df$state)
df['country'] = as.factor(df$country)

Descriptive statistics of the variables in the dataset

##                  vars      n      mean         sd    median   trimmed       mad
## name*               1 378657 187974.22  108445.68 188057.00 187990.38 139271.00
## category*           2 378661     81.74      45.13     88.00     82.31     56.34
## main_category*      3 378661      8.51       3.90      8.00      8.70      4.45
## launched            4 378661       NaN         NA        NA       NaN        NA
## deadline            5 378661       NaN         NA        NA       NaN        NA
## state*              6 378661      2.66       1.13      2.00      2.68      0.00
## backers             7 378661    105.62     907.19     12.00     28.84     17.79
## country*            8 378661     19.85       6.27     23.00     21.32      0.00
## usd_pledged_real    9 378661   9058.92   90973.34    624.33   2082.19    925.63
## usd_goal_real      10 378661  45454.40 1152950.06   5500.00   9399.97   6671.70
##                   min       max     range  skew kurtosis      se
## name*            1.00    375755    375754  0.00    -1.20  176.23
## category*        1.00       159       158 -0.06    -1.23    0.07
## main_category*   1.00        15        14 -0.24    -0.80    0.01
## launched          Inf      -Inf      -Inf    NA       NA      NA
## deadline          Inf      -Inf      -Inf    NA       NA      NA
## state*           1.00         6         5  0.43    -0.96    0.00
## backers          0.00    219382    219382 86.76 13954.68    1.47
## country*         1.00        23        22 -1.70     1.30    0.01
## usd_pledged_real 0.00  20338986  20338986 82.19 11796.26  147.84
## usd_goal_real    0.01 166361391 166361391 78.22  7082.76 1873.64

Distribution of categories

It will not be needed for this research question, but let’s check if out of curiosity

## [1] "number of specific categories of Kickstarter projects: 159"
## [1] "Number of general categories of Kickstarter projects: 15"

Distribution of the state variable


It seems that most of the campaigns are either failed or successful. I will merge canceled and suspended campaigns into failed and discard all other campaigns to obtain binary class success/failure

df <- df %>% 
  mutate(state = case_when(
    state == 'failed' ~ 'failed',
    state == 'successful' ~ 'successful',
    state == 'suspended' ~ 'failed',
    state == 'canceled' ~ 'failed',
  )) 

df <- df %>% filter(df$state == 'successful' | df$state == 'failed')

df %>% 
  group_by(state) %>% 
  summarise(n=n(), percentage = paste(c(round(100*n/376046,2)),'%')) %>% 
  ggplot(aes(x='', y=n, fill=state)) + 
  geom_bar(stat="identity", width=1) + 
  coord_polar("y", start=0) + theme_void() + 
  geom_text(aes(y = n, label = percentage), color = "white", size=3, position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values=c('#2bde73',
                             '#2bd9de',
                             "#eeeeee",
                             '#081245',
                             '#122906',
                             'darkgrey')) +ggtitle("Distribution of successful and failed projects")


Great! Now we have binary variable of campaigns’ final states - Success / Failure
Now, let’s analyse distribution of numerical variables

Numerical Variables

First, let’s plot and analyze the distribution of goals that failed and successful projects and their pledges + number of backers.


Distribution of backers in successful projects seems to be normal or close to normal, whereas the distribution in failed projects seems to be positively skewed, which makes sense if there are usually less backers for unsuccessful campaigns.



Goal seems a normal distribution, so it would be good if we test the normality of it and also, if exists some statistical difference between means of goals and pledged failed and success projects to answer our research question

Normality tests

Normality tests for usd_goal_real, usd_pledged_real and backers
anderson-darling test + qqplot ###usd_goal_real

## [1] "usd_goal_real: Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "usd_goal_real (only successful projects): Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "usd_goal_real (only failed projects): Anderson-Darling Normality Test p_value: 3.7e-24"


usd_goal_real is not normally distributed as in all cases p-value is smaller than 0.05

usd_pledged_real

## [1] "usd_pledged_real: Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "usd_pledged_real (only successful projects): Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "usd_pledged_real (only failed projects): Anderson-Darling Normality Test p_value: 3.7e-24"


usd_pledged_Real is not normally distributed as in all cases p-value is smaller than 0.05 ### backers

## [1] "backers: \n      Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "Backers (only successful projects): Anderson-Darling Normality Test backers: 3.7e-24"
## [1] "Backers (only failed projects): Anderson-Darling Normality Test p_value: 3.7e-24"


Neither Goal nor Pledged and Backers are not normally distributed in general and within groups

Violin plots for checking the distribution of numerical variables


It seems that there are significant outliers in all three variables (goal, pledged and backers)
Let’s remove these rows from the dataset

Removing Outliers

# removing usd_pledged_real outliers
pledged_outlier_scores <- scores(df$usd_pledged_real)
df[pledged_outlier_scores > 3 | pledged_outlier_scores < -3, 'usd_pledged_real'] <- NA

# removing usd_goal_real outliers
real_outlier_scores <- scores(df$usd_goal_real)
df[real_outlier_scores > 3 | real_outlier_scores < -3, 'usd_goal_real'] <- NA

# removing backers outliers
backers_outlier_scores <- scores(df$backers)
df[backers_outlier_scores > 3 | backers_outlier_scores < -3, 'backers'] <- NA

# checking for NA's (outliers)
#df %>%
#  summarise_all(funs(sum(is.na(.))))

# Dropping rows containing NA values
dim1 = dim(df)[1]
df <- df %>% drop_na() 
dim2 = dim(df)[1]
paste('Dropped', dim1-dim2, 'outliers')
## [1] "Dropped 2589 outliers"

Differences between means of three variables between 2 groups (success, failure)

first, let’s see violin plots of the three variables grouped by state


This shows us two things:
  • it seems that there might be a (probable) difference in mean goal
  • it seems that there is a difference in mean pledged and mean backers


Variances equality (Levene test)

before checking the differences between means, it is important to check if variances between groups are equal

## [1] "Levene Test for usd_goal_real variable: Value - 4279.81741093291 ;P - 0"
## [1] "Levene Test for usd_pledged_real variable: Value - 33502.3089465996 ;P - 0"
## [1] "Levene Test for backers variable: Value - 37010.2056946422 ;P - 0"


In all cases, p-value of Levene Test is very close to 0 (smaller than 0.05) which means that variances are not equal which is an important insight before conducting T-Test

T-Test for independent groups

One-sided T-Test for usd_goal_real variable with ‘state’ as grouping variable

## 
##  Welch Two Sample t-test
## 
## data:  df$usd_goal_real by df$state
## t = 92.395, df = 248355, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group failed and group successful is not equal to 0
## 95 percent confidence interval:
##  23377.89 24391.22
## sample estimates:
##     mean in group failed mean in group successful 
##                32031.682                 8147.127

One-sided T-Test for usd_pledged_real variable with ‘state’ as grouping variable

## 
##  Welch Two Sample t-test
## 
## data:  df$usd_pledged_real by df$state
## t = -164.56, df = 140512, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group failed and group successful is not equal to 0
## 95 percent confidence interval:
##  -11908.41 -11628.08
## sample estimates:
##     mean in group failed mean in group successful 
##                 1444.783                13213.030

One-sided T-Test for brackers variable with ‘state’ as grouping variable

## 
##  Welch Two Sample t-test
## 
## data:  df$backers by df$state
## t = -179.5, df = 137880, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group failed and group successful is not equal to 0
## 95 percent confidence interval:
##  -152.1321 -148.8457
## sample estimates:
##     mean in group failed mean in group successful 
##                 17.48571                167.97460


In all cases, the result of T-Test indicated p-value lower than 0.05 which means that for every variable, mean of groups (successful / failed) are not equal

Means visualization

Knowing that for every analysed numeric variable means between groups are different, visualization of means (including confidence intervals) would be insightful

usd_goal_real - means of project groups

usd_pledged_real - means of project groups

backers - means of project groups


That’s very interesting – failed campaigns tend to have significantly bigger goals, but collect less money and attract less backers than campaigns that achieve success.

Let’s check one last thing – how close to goal campaigns are (over or under the goal)

Closeness to goal value

Let’s check how on average campaigns were either above the goal or below the goal. The value will be counted in percentages of money pledged (calculated by dividin usd_pledged_real by usd_goal_real)


This is interesting as well! It seems that successful campaigns not only have higher pledgesand lower goals at the same time when compared to failed campaigns, but also, they significantly exceed the goal (mean=7.41, CI = 1.82), whereas failued campaigns are on average not even close to their goals (mean=0.29, CI=0.18).

Results

The analysis of means of variables campaign goal, pledged money and number of backers with the use of one-sided T-Test for independent groups revealed that successful Kickstarter campaigns that took place between 2009 and 2017 had significantly lower goals (mean=8147, CI=74.7) than failed campaigns (mean=32032, CI=501). However, successful campaigns had significantly higher average pledged money (mean=13213, CI=138) and average number of backers (mean=168, CI=1.62) than failed campaigns with lower average pledged money (mean=1445, CI=25.0) and number of backers (mean=17.5, CI=0.25). Additionally, successful campaigns on average exceed the goal by 741% (mean=7.41, CI=1.82), whereas failed campaigns raise on average 28.62% of established goal (mean=0.286, CI=0.18).

Conclusion

Conclusion to be drawn from the results of the analysis can indicate that successful campaigns, are the ones whose owners establish lower goals. Such campaigns not only attract significantly more backers, but also collect significantly more money than campaigns that are unsuccessful and have on average higher established goals. What is also suprising, is the fact that successful campaigns significantly exceed planned goals, whereas failed campaigns on average don’t get close to the goal. There is a number of possible implications:
  1. More backers believe that campaigns with lower goals will be successful
  2. Smart campaigns creators know about the above and intentionally set lower campaign goal,
  3. while counting that it will attract more backers and exceed the goal.
  4. Kickstarter algorithm possibly tend to promote campaigns that are closer to reaching the goal.
  5. Campaign that fail are usually projects that are very expensive and difficult to fund.


Limitations

  • First limitation is that the dataset is highly skewed and contain many extreme values.
  • Second limitation is that the numerical variables used in the analysis are not normally distributed.
  • Third limitation is that analysed groups (failed/success) are not equal.

Recommendations for further research

  • Analyse other features of campaigns and their relationship with success
  • Analyse trends in success ratio per time of campaigns and look for seasonality
  • More carefully filter the dataset to avoid high number of extreme observations
  • Create a statistical model for predicting whether campaign will be successful or not

What’s next?

These results made me think more about the dataset. I came up with another hypothesis that can be tested:
Variables such as usd_goal, data of launch, period (30/60 days), category, main_category, and country / region have relationship with whether campaign is successful or not.


We already know the distribution of numerical data, so analysing it again will not be needed. Also, outliers have already been dropped. What’s more data is already preprocessed (changed state variable to binary)

Categories (main_category variable)

Let’s come back to the categories! There are 15 main categories that can be analysed - let’s do so :)
  • How are categories distributed?
  • Which categories are most successful?
  • Which category was most successful every year?


“Film and Video” is the most frequent project category, whereas “Dance” is least frequent. I’m curious if there is a relationship between frequency of category and it’s success rate.


Hmm… Interesting, the least frequent categories like Dance, Theater or Comics have the highest success rate which is higher than 50%! That might indicate that there is a relationship between category (and its frequency) and state of the project (success / failure)

Let’s check which category has most generous backers – or rather, which category got on average biggest amount of money from backers.


Category “Dance” is most successful, but it also has low project mean goals and relatively low mean pledged amounts. On the other hand, technology is least successful, even if it highest mean goal and very high mean pledged amounts. This might indicate the relationship between main category and success of campaign.

Most successful category per year


And again, it Dance is highlighted - this time the analysis show that it was most successful project category from 2010 to 2014 – I wonder if Kickstarter became popular due to Dance category…

Years

I wonder how the success rate was changing over years and which categories were most popular over those years.


It seems that average success rate has fallen drastically after 2013 and hasn’t came back to the prior state until 2017. I wonder what could cause this decrease? Maybe number of new projects every year? More projects, more failures?


WOW! It seems that the annual average number of projects and annual average success rate might be correlated!

## [1] "Pearson's correlation value: -0.840287151309151 ;p-value: 0.00456868893598974"

Pearson’s correlation p-value is lower than 0.05 which indicates that there is a (negative) significant correlation (with coefficient -0.84) between annual number of projects and annual success rate! That’s very intersting insight about functioning of Kickstarter! #Pattern #Spotted :D

Months

Now, let’s check if there is any trend in months


We can notice a pattern that there are significant decreases of success rate in July, December and January. That might indicate relationship between month of year and success of the project. Also, let’s check if average number of projects per month of year has any relationship with success rate.


From these two graphs it is difficult to distinguish a relationship, let’s calculate the correlation, just in case :D

## [1] "Pearson's correlation value: 0.300896677207677 ;p-value: 0.341927140320416"


Person’s correlation between monthly success rate and monthly number of projects does not indicate significant relationship as the p-value is equal to 0.34 (>0.05). Which means that there might be a seasonality in success rate (in terms of months), but it cannot be explained with monthly number of launched projects.

Project lengths

Kickstarter allows users to start funding for month or for two months, let’s check if it has any influence on success of campaigns.

First, let’s calculate length of each campaign

df['deadline'] <- as.Date(df$deadline)
df['launched'] <- as.Date(df$launched)
df['length_days'] <- as.numeric(df$deadline - df$launched)
#max(df$length_days)
df <- df %>% #removing an outlier
  filter(!length_days == 14867)
df <- df %>% # new column indicating whether campaign has 1 month or 2
  mutate(
    length_months = round(length_days / 30)
  )


Let’s check distribution of successful and failed campaigns


The most part of projects have 1 month of campaign. We can see that the ratio of successful one month campaigns is better than projects with 1.5 or 2 months of campaign. It indicates that there might be significant relationship between length of campaign and its success.

p <- df %>% 
  filter(length_months <= 2 & length_months > 0) %>% 
  mutate(
    state_en = case_when(
    state == 'failed' ~ 0,
    state == 'successful' ~ 1)) %>%
  group_by(length_months) %>% 
  summarise(success_ratio = mean(state_en),
            n=n()) %>% 
  arrange(desc(success_ratio))
datatable(p)


1-month-long campaigns seems to have significantly higher success rate. However, the classes are highly imbalanced.

Countries and Regions

One last thing to check in this dataset are countries. Let’s start with distributino of project among countries


It seems that vast majority of projects come from USA. I believe that segregarion of countries into region will help to slightly balance this imbalnace :D


Looks a little bit better! Let’s see distributin of success and failures of campaigns among the regions


Graph like this is not really clear and it’s hard to make any conclusions. Let’s try to show the success rate per country in a form of a table:


According to this analysis, the US has not only the highest number of campaigns, but also the highest success rate, on the other hand, European countries (excluding GB) have the smallest success rate. It might indicate a relationship between success and region, however, the distribution is imbalanced and sample from US is significantly bigger than any other which makes it difficult to asses if such relationship exists.

Conslusions

According to this extensive EDA, there are multiple interesting patterns to investigate in the data:
  • Relationship between categories and chance of success
  • Trend of growth and decrease of success rate per year
  • Negative correlation between Number of projects per year and success rate
  • Possible seasonality, according to certain months of years when success rate decreases
  • Higher success rate of shorter campaigns
  • Relationship between region and success rate

Recommendations

Taking all of this to the account, I recommend the following actions in terms of further research:
  • Fitting classification model like logistic regression to the data in order to classify successful and failed campaigns. Features to be included: Month, Length, Main Category, Region, Goal and see how it works
  • Conduct qualitative research about possible causes of relationships indicated in this analysis

    THANKS FOR YOUR ATTENTION

    #It was really long, but I hope that at least some of the insights are useful!